Goto

Collaborating Authors

 variable transformation


CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

Cheng, Yutong, Liu, Yang, Li, Changze, Song, Dawn, Gao, Peng

arXiv.org Artificial Intelligence

Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.


Stat Stories: Normalizing Flows as an Application of Variable Transformation

#artificialintelligence

While other statistical methods such as Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAN) have been able to perform dramatic results on difficult tasks such as learning distributions of images, and other complicated datasets, they do not allow evaluation of density estimation and calculation of probability density of new data points. In such a sense, Normalizing Flows proves to be eloquent. The method can perform density estimation and sampling as well as variational inferences. Consider a transformation u g(x; θ), i.e., g is parametrized by parameter vector θ.


Understand Weight of Evidence and Information Value! - Analytics Vidhya

#artificialintelligence

We have all built a logistic regression at some point in our lives. Even if we have never built a model, we have definitely learned this predictive model technique theoretically. Two simple, undervalued concepts used in the preprocessing step to build a logistic regression model are the weight of evidence and information value. I would like to bring them back to the limelight through this article. First thing first, we all know logistic regression is a classification problem.


Learning physics-based reduced-order models for a single-injector combustion process

Swischuk, Renee, Kramer, Boris, Huang, Cheng, Willcox, Karen

arXiv.org Machine Learning

This paper presents a physics-based data-driven method to learn predictive reduced-order models (ROMs) from high-fidelity simulations, and illustrates it in the challenging context of a single-injector combustion process. The method combines the perspectives of model reduction and machine learning. Model reduction brings in the physics of the problem, constraining the ROM predictions to lie on a subspace defined by the governing equations. This is achieved by defining the ROM in proper orthogonal decomposition (POD) coordinates, which embed the rich physics information contained in solution snapshots of a high-fidelity computational fluid dynamics (CFD) model. The machine learning perspective brings the flexibility to use transformed physical variables to define the POD basis. This is in contrast to traditional model reduction approaches that are constrained to use the physical variables of the high-fidelity code. Combining the two perspectives, the approach identifies a set of transformed physical variables that expose quadratic structure in the combustion governing equations and learns a quadratic ROM from transformed snapshot data. This learning does not require access to the high-fidelity model implementation. Numerical experiments show that the ROM accurately predicts temperature, pressure, velocity, species concentrations, and the limit-cycle amplitude, with speedups of more than five orders of magnitude over high-fidelity models. Moreover, ROM-predicted pressure traces accurately match the phase of the pressure signal and yield good approximations of the limit-cycle amplitude.


Featuring Engineering in Python: Variable distribution

#artificialintelligence

Linear regression is a common technique used in the association study between the targeted outcome and some potential risk factors (e.g., age, sex). The violation of the normality assumption sometimes may be attributed by the skewed nature of the dependent variable and may be a concern for naturally skewed outcome variables, such as best corrected visual acuity, 1 refractive error, 2 and Rasch score. Normality violation will affect the estimates of the standard error (SE) and the confidence interval, and hence the significance of the risk factors. Nonparametric regression model or bootstrap techniques are suggested to be performed as they provide more robust estimates of SE. However, nonparametric techniques require large sample sizes to supply; the model structure, and are very sensitive to the outliers. Thus, a key question is whether simple linear regression modeling still is valid if the "normality assumption" is violated.


A Comprehensive Guide to Data Exploration

@machinelearnbot

There are no shortcuts for data exploration. If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won't. After some point of time, you'll realize that you are struggling at improving model's accuracy. In such situation, data exploration techniques will come to your rescue. I can confidently say this, because I've been through such situations, a lot.